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Abstract 

Fisher's linear discriminant analysis (FLDA) is an important dimension reduction method in sta- 
tistical pattern recognition. It has been shown that FLDA is asymptotically Bayes optimal under the 
homoscedastic Gaussian assumption. However, this classical result has the following two major limita- 
tions: 1) it holds only for a fixed dimensionality D, and thus does not apply when D and the training 
sample number N are proportionally large; 2) it does not provide a quantitative description on the 
performance of FLDA. In this paper, we present an asymptotic generalization analysis of FLDA based 
on random matrix theory in the setting where both D and N increase and \\mD/N = 7 g [0,1). 
The obtained asymptotic generalization bound overcomes both limitations of the classical result, i.e., 
it is applicable when D and N are proportionally large and provides a quantitative description of the 
generalization ability of FLDA in terms of the ratio D/N and the population discrimination power. 

Index Terms 

Fisher's linear discriminant analysis (FLDA), asymptotic generalization analysis, random matrix 
theory 

I. Introduction 

Fisher's linear discriminant analysis (FLDA) HI is one of the most representative dimen- 
sion reduction techniques in statistical pattern recognition . By projecting examples into a low 
dimensional subspace with maximum discrimination power, FLDA helps improve the accuracy 
and the robustness of a decision system O flU [|5l O. During the past decades, FLDA has been 
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applied to a wide range of areas, from speech/music classification [HI, face recognition [|9) 
ifTOll to financial data analysis OH lfl2l . 

An important property of FLDA is its asymptotic Bayes optimality under the homoscedastic 
Gaussian assumption |[T3l ffT4l |fT5ll , which is a corollary of classical results from multivariate 
statistics lfi6l . Actually, as training sample number N goes to infinity, both the within-class 
scatter matrix £ (sample covariance) and the between-class scatter matrix S converge to their 
population counterparts £ and S. Therefore, the empirically optimal projection matrix W* 
of FLDA, obtained by generalized eigendecomposition over £ and S, also converges to its 
population counterpart W*. Thanks to the asymptotic Bayes optimality, we can expect an 
acceptable performance of FLDA as long as iV is sufficiently large. However, this classical 
result, i.e., the asymptotic Bayes optimality, suffers from two major limitations: 

1) It is obtained by fixing the dimensionality D and letting only TV increase to infinity. But 
in practice, e.g., in face recognition, D and N can be proportionally large, which makes 
the classical result inapplicable. 

2) It does not provide quantitative description on the performance of FLDA. Specifically, 
given D and N, we are still unaware of how good the performance of FLDA can be. 

A. The Contribution of this Paper 

To address aforementioned limitations of the classical result, in this paper, we present an 
asymptotic generalization analysis of FLDA. Our analysis is superior from two aspects. First, 
we modify the setting of analysis by allowing both D and iV to increase and assuming the 
ratio D/N — > 7 G [0,1). This makes our result applicable in the case where D and iV 
are proportionally large. Second, we quantitatively examine the generalization ability of FLDA. 
Denoting by A(£, S|W*) the generalization discrimination power of FLDA, we intend to bound 
it from the lower side by using the population discrimination power A(£,S|W*), D and N. 
Taking a binary-class problem, for example: suppose A(£, S|W*) = A and 7 = D/N, then our 
asymptotic generalization bound shows that A(S, S|W*) is almost surely larger than 

cos 2 (arccos(A/A/ (A + 7)) + arccos(A/l — 7)) A, 

under mild conditions. 
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Based on the obtained asymptotic generalization bound, we can get better insight of FLDA. 
First, it is commonly known, though not having quantitatively described before, that the perfor- 
mance of covariance estimation has a severe influence to the generalization ability of FLDA. 
By assuming a sufficient population discrimination power so as to eliminate the influence from 
between-class matrix estimation, we show that the mere influence from covariance estimation 
is proportional to ratio 7 = D/N, i.e., due to the imperfection of covariance estimation, 
A(S,S|W*) is about 1 — 7 times of A(S,S|W*). Besides, in multiclass cases, e.g., c + 1 
classes, empirical study has shown that the best dimensionality of FLDA can be less than c, which 
implies that the generalization ability of the last few dimensions of FLDA's empirically optimal 
projection matrix can be poor. This is also explained by the obtained asymptotic generalization 
bound. 

It is worth noticing that the setting D/N — > 7 G [0,1) substantially implies N > D 
when performing covariance estimation. In recent years, high-dimensional covariance estimation 
where D can be lager (or much larger) than N has received considerably attentions. In such case, 
regularized estimation is generally needed, e.g., the sparse inverse covariance matrix estimation 
ifTTll [fT8l . the thresholding estimation [19], the banded estimation EDI , and the factor model 
based estimation [|2T|. to name a few. In the literature of FLDA, the case N < D is usually 
referred to as the small(or under) sample problem and regularized discriminant analysis has also 
been studied [|22l E3l [|24ll . In this paper, we do not consider the case N < D, i.e., we do not 
use any regularized estimation of the covariance matrix or regularized FLDA. 

B. Tools 

The technical tools used in our asymptotic generalization analysis are from random matrix 
theory (RMT) (25l fl26l J27]] ED, the main goal of which is to provide understanding of 
the statistics of eigenvalues of matrices with entries drawn randomly from various probability 
distributions. RMT was originally motivated by applications in nuclear physics in 1950's, and 
then it was intensively studied in mathematics and statistics. It also found successful applications 
in engineering fields, e.g., wireless communications [29], recently. In this paper, we make use of 
two important results from RMT. The first one is the Marcenko-Pastur Law ll27ll . which states 
that the empirical spectral distribution of a Wishart random matrix converges almost surely to 
a deterministic distribution -F 7 (A) as D/N — y 7 e [0,1). The second one is the almost sure 
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convergence of the extreme singular values of a large Gaussian random matrix [|28l . We formulate 
these two results in following propositions. 

Proposition 1: Given G G IR DxiV , whose entries are independently sampled from standard 
Gaussian distribution A/"(0, 1), then as both D and N — > oo and D/N — > 7 G [0, 1), the 
empirical distribution of the eigenvalues of ^GG T , i.e., 

1 D 1 

^(A) = -^1{A,(-GG T ) <A}, A>0, (1) 
converges almost surely to a deterministic limit distribution F y (X) with density 

where 

A+ = (1 + v^) 2 and A_ = (1 - ^f. (3) 
Proposition 2: Letting G G IR Dxm with i.i.d. entries sampled from jV(0, 1), then as m/D — V 

ie [0,1), 



and 



J <w(G) ^ 1 + ^7, (4) 



Omm(G) 1 - yft. (5) 



C. Notations 

Throughout this paper, we will use the following notations. Bold lower case letter a denotes 
a vector. Bold upper case letter A denotes a matrix. M D denotes a £)-dimensional vector space. 
M. DlXl>2 denotes the set of all D\ by D 2 matrices. A„ or {A}jj denotes the i-th diagonal entry 
of a symmetric matrix A. A, denotes the i-th column of A. A 1:c denotes the matrix composed 
by the first c columns of A. § D_1 denotes the D-dimensional unit sphere located on the original 
point. §^ x£) denotes the set of all D by D positive definite matrices. ||a|| denotes the £2 norm 
of a. cr max (A) and a m i n (A) are the extreme singular values of A. ||A|| = a max (A) denotes the 
operator norm of A. Aj(A) denotes the i-th eigenvalue of A, sorted in a descent order. A(A) 
denotes the diagonal matrix composed of the eigenvalues of A, with the eigenvalues sorted in 
a descent order. 71(A) denotes an orthogonal basis of the range or the column space of A. 
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[e 1; e D ] is the canonical basis of R D . 

II. Preliminary 

A. Population Discrimination Power 

Suppose we have c+1 classes, represented by homoscedastic Gaussian distributions in a 
high-dimensional space IR D , £), i = 1,2, ...,c+ 1, with class means fa G M D and the 

common covariance matrix £ G 8+ xD . Assuming the classes have equal prior probability 
the following matrix S, which is referred to as the between-class scatter matrix, gives a measure 
of class separation, 

j c+1 ^ c+1 

S = — - - - A0 T > witn A* = — T (6) 

1 = 1 4=1 

Given a projection matrix W G M Dxd , the linear transformation z = W T x reduces the dimen- 
sionality from D to d. According to Fisher's criterion ||30ll (3), the discrimination power in the 
dimension reduced space is given by 

A(S, S|W) = Tr ((W T SW)- 1 W r SW) . (7) 

Therefore, if the population parameters E and S are known, the optimal projection matrix W* 
can be obtained by, 

W* = arg max A(S,S|W). (8) 

weK Dxd 

Note that ^ implies rank(S) < c. Without loss of generality, we assume rank(S) = c, i.e., 
the class means of c + 1 classes span a c dimensional hyperplane in R D . Then, for it is 
sufficient to restrict W G IR I?XC . Besides, by the property of the trace operator, ^ is invariant 
to the transformation W +- WA, with A G M cxc being any nonsingular matrix. Thus, we can 
further require W T SW = I c , which does not affect d8]>. As a result, we have 



W* = arg max A(S,S|W). (9) 

W T SW=I C 



'For the convenience of expression, we assume an equal prior probability. This does not substantially change the analysis 
throughout this paper. 
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Given the optimal projection matrix W*, the quantity A(£, S|W*) preserves the discrimination 
power among the c + 1 classes in the original space R D , and we refer to it as the population 
discrimination power. 

By using simultaneous diagonalization [30], we have the following proposition on W* and 
A(£,S|W*). Please refer to lf30l for the proof of simultaneous diagonalization. 

Proposition 3: There exists a nonsingular matrix X* = [W* V*], with W* G IR Dxc and 
V* G M. Dx ( D ~ c \ that simultaneously diagonalizes £ and S, i.e., 

X* T £X* = I and X* T SX* = A, (10) 

where A is a diagonal matrix, with only the first c diagonal entries being nonzero. Further, 

X* = iHu* and W* = £^U 1:c , (11) 

where U* is from the eigendecomposition £~2S£~2 = U*AU* T ; and, 

c 

A^SlVrH^rA,, (12) 

2=1 

where \, % — 1, 2, c, is the first c diagonal entries of A. 

In fact, Aj, i = 1, 2, c, are also the nonzero eigenvalues of the generalization eigendecom- 
position S£ = A££, and W* and V* are the two invariant subspaces associated to the nonzero 
and zero eigenvalues, respectively. 

B. Generalization Discrimination Power 

In practice, we do not have access to population parameters, £ and S, but are given a set 
of training examples. Suppose there are n examples x* for each class, i = 1,2, ...,c+ 1, j = 
1,2, ...,n, and in total N = (c+ l)n training examples for all classes. The empirical estimates 
of £ and S are given by, 

1 c+l n 

i=l 3=1 
, c+l 
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where 

c+l 



& = - W- and £ = — - (15) 
n ' J c+l ' 



Under the condition A^ > L>, it holds almost surely X G §^ + D and rank(S) = c. Then, analogous 
to the empirically optimal projection matrix W* of FLDA is given by 

W* = arg max A(£,S|W). (16) 

W T SW=I C 

The performance of W* can be evaluated by examining the generalization discrimination 
power A(£, S|W*). We propose to compare A(£, S|W*) with its population counterpart A(£, S|W*) 
First, the following lemma gives an exact expression of A(S,S|W*). 

Lemma 1: Let 

S = X* T £X* and S = X* T SX*, (17) 



where X* is from Proposition [3j Given eigendecompositions S = UA(S )U T and So = 
VA(S )V T , it holds 

c 

A(S,S|W*) = ^ < 5 i A i , (18) 



where 



i=i 



Si = ||^ T (A- 1 (S )U T V 1:c )U T e,,|| 2 . (19) 



From Lemma [T] we have the following observations on the generalization ability of FLDA: 
1) Given the population discrimination power A(£,S|W*) = Y^i=i^i> tne generalization 
discrimination power A(£,S|W*) is exactly determined by S t , i = 1,2, ...,c. According 
to (19), Si is affected by the eigenvalues and eigenvectors of the "normalized" estimator 
S and Sq rather than £ and S. Since X* T SX* = I, (17) implies that Sq is an empirical 



estimate of the identity covariance matrix I. Thus, fixing A(£,S|W*) = ^° =1 Aj, the 
generalization ability of FLDA, i.e., A(S, S|W*), is independent of population covariance 
S. This is very important since it allow us to get rid of the covariance structures, especially 
the conditional number \ max (£) / \ min (Yi) or the minimum eigenvalue A min (E), when 
conducting generalization analysis. 
2) The eigenvalues and eigenvectors of S and S play the key role in evaluating Si or 
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A(S, S|W*). Therefore, deriving the properties o/A(S ). U an d V 1:c becomes the main 
task in the asymptotic generalization analysis later. 

III. Properties of the Normalized Estimators 

We have known from Lemma [l] that A(S,S|W*) = Y11=i w i m &i exactly determined 
by eigenvalues and eigenvectors of S and Sq. In this section, we present useful lemmas on 



properties of these eigenvalues and eigenvectors. First, according to (10) and (17), we have the 
following proposition on S and S . 

Proposition 4: Given X* that simultaneously diagonalizes £ and S, i.e., X* T SX* = I and 
X* T SX* = A, then S = X* T SX* and S = X* T SX* are the corresponding estimates of I 
and A, respectively. 

A. Properties of S 

First, we have the following lemma on the eigenvalues and eigenvectors of S . 
Lemma 2: Given the eigendecomposition S = UA(S )U T , it holds 

1) U and A(S ) are independent random variables; 

2) U follows the Haar distribution, i.e., it is uniformly distributed on the set of all orthonormal 
matrices in ]R £)xD ; 

3) denoting by F N (X) the empirical spectral distribution of the eigenvalues of S , i.e., 

1 D 

Fn(X) = HMSo) < A}, A > 0, (20) 



8=1 



then, as D/N — > 7 G [0,1), 

F N (X) ^ F 7 (A), (21) 

where the limit distribution F y (X) has the density 

, pm 1 y/(A + -A)(A-A_) ^ 

with 

A+ = (1 + v/t") 2 and A_ = (1 - VlY- (23) 
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The first and second statements in Lemma [2] can be understood by the fact that S is the 
empirical estimate of I, whose probability density is invariant to any orthogonal transformation. 
The last statement is a corollary of the Marcenko-Pastur law, i.e., Proposition [TJ which says 
that the empirical spectral distribution of the matrix ^GG T , wherein G G IR DxAr has i.i.d 
entries sampled from jV(0, 1), converges almost surely to the deterministic distribution -F 7 (A) as 
D/N — >7G [0,1). 



Further, due to the inverse operation in (19), we need the following lemma on A _1 (E ) and 
A~ 2 (£ ), which says that the energy of A _1 (Eo) and A~ 2 (£ ) projected onto a random direction 
is almost surely deterministic in the limit. 

Lemma 3: Suppose £ is a unit-length random vector uniformly distributed on the unit sphere 
S ' 1 and it is independent of S , then as D/N — > 7 G [0, 1), 



7 



(24) 



and 



e A- 2 (£ )£ ^ / X dFj(X) = — ^. (25) 



B. Properties of S 

We have the following lemma on the first c eigenvectors of So- 

Lemma 4: Given the eigendecomposition S = VA(S )V T , then as D/N — > 7 G [0, 1), 

lim IIV^H 2 > — ^— , a.s., i = l,2, c, (26) 

D/N^~f ■ Xi + 7 

where Aj is the i-th diagonal entry of A in Proposition [3} 

Note that the first c eigenvectors of A are Ii :c = [ei,...,e c ]. Thus, from the relationship 
between So and A, Vi :c is actually an estimate of Ii :c . Lemma [4] describes the performance of 
this estimate by using Aj and 7. Specifically, as approaches 1, becomes more included 
by V 1:c . 

IV. Asymptotic Generalization Bound 

In this section, we prove our main result, which is an asymptotic lower bound of the gener- 
alization discrimination power A(S, S|W*). Recall the result in Lemma [IJ i.e., A(S, S|W*) = 
Yli=i We first present a lower bound of <5j. 
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Lemma 5: Given the eigenvalues A(S ) of S and the first c eigenvectors V 1:c of S , it holds 

Si > max 2 {cos(#),0}, (27) 

where 

9 = arccos(||V?; c e,||) + arccos (Va" 1 ^/ J^A^oK] , (28) 



with £ being a unit-length random vector uniformly distributed on the unit sphere E> D 1 . 

Then by Lemmas |3j [4] and [5} we have the following theorem on the generalization discrimi- 
nation power A(£, S|W*). 

Theorem 1: Suppose the population discrimination power is given by A(£, S|W*) = YTi=i ^ 
and W* is the empirically optimal projection matrix of FLDA. For the generalization discrimi- 
nation power A(£, S|W*) = Y11=i &i\> as both the dimensionality D and the training sample 
number N increase (N > D) and D/N — > 7 G [0, 1), it holds almost surely 

Si> r] i = max 2 { cos(arccos(A/Ai/(Aj + 7)) + arccos(A/l — 7)), 0}. (29) 

Proof: By Lemma [3j we have 



1 



hm — . = = — f J — = Wl — 7, a.s. (30) 

D/N — > 7 



;^A-2(s )e (1-7) 

By Lemma |4| we have 



lim HV^eill > + a.s. (31) 

D/N — >7 



Then the proof is completed by substituting (30) and (31) into Lemma [5] 



From Theorem [TJ we have the following observations: 

1) In the limit, the lower bound rj i of 8i, i = 1,2, ...,c, is determined by the population 
discrimination power A« and the dimensionality to training sample number ratio D/N. 
Figure [/J shows the lower bound rj i as a function of '7 = D/N and Aj. 



2) The influence of^y = D/N to r] i comes from two aspects, each through the term y Aj/(Aj + 7) 
and the term y/1 — 7. Note that a/ X%/ (Aj + 7) allows a tradeoff between Aj and 7, i.e., 
the affection caused by a large 7 can be relatively reduced by a large \. This is consistent 
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with the intuition that a problem with a larger population discrimination power, i.e., A,, 



should be easier for empirical learning. The second term y/1 — 7 is only related to 7, and 



according to (30), it comes from A(Sq), i.e., the eigenvalues of the normalized sample 



covariance. It describes how covariance estimation influences the generalization ability of 



FLDA. By assuming a sufficient large Aj so that y Aj/(Aj +7) ~ 1, we have 

r)i « 1 - 7, (32) 

which shows that given the dimensionality to training sample number ratio 7 = D/N, 
the loss of discrimination power due to the imperfection of covariance estimation is 
approximately 7. To the best of our knowledge, this is the first quantitative result on 
the influence of covariance estimation to FLDA, although it has been commonly noticed 
in the literature. 

3) Fixing the dimensionality to training sample number ratio 7 = D/N, the multiple lower 
bounds T7j, i = 1, 2, c, are individually determined by the corresponding X it i = 1, 2, .., c. 



According to (29), a smaller \ leads to a smaller rj^ Thus, the last few dimensions of 
FLDA's empirically optimal projection matrix, corresponding to small Aj, may have poor 
generalization ability. This explains why in practice the best dimensionality of FLDA for 
a c + 1-class problem can be less than c. 



V. Empirical Evaluations 

In this section, we present experiments on both synthetic and real datasets to evaluate the va- 
lidity of the obtained asymptotic generalization bound. According to Theorem [T] comprehensive 
evaluations involve comparisons between Si and r\ i under different settings of Aj, D, and A" (or 
7 = D/N). Recall that 

<5, = ||^ T (A- 1 (S )U T V 1:c )U T e 4 || 2 , (33) 

wherein A(S ) and U are the eigenvalues and the eigenvectors of S and V 1:c are the first 
c eigenvectors of S . Since S = X* T £X* and S = X* T SX*, we need X* T and therefore 
population parameters £ and S, to calculate <5j. For the synthetic data case, we can specify these 
population parameters. But for the real data case, they are unknown. Therefore, we choose real 
datasets with sufficiently large number of examples, i.e., N ^> D, and treat the estimates with 
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the entire dataset as "population" parameters. In addition, <5j is a random variable, and thus we 
do Monte Carlo experiments to obtain its realizations. As for 

= max 2 { cos(arccos(A/Ai/ (Aj + 7)) + arccos(A/l — 7)), 0}, (34) 

it is a deterministic variable and related to Aj and 7. We vary Aj and 7 so as to evaluate r] i in 
different situations. 



A. On Synthetic Datasets 

We design three examples for synthetic data evaluation, by varying class number c + 1, 
population discrimination power Aj, i — 1, 2, ..,c, and the dimensionality D. Specifically, these 
parameters are listed below: 

. Example 1: c + 1 = 2, Ai = 1, D e {10, 50, 100, 200}. 



August 16, 2012 



DRAFT 



13 



. Example 2: c + 1 = 2, X 1 = 10, D e {10, 50, 100, 200}. 

. Example 3: c + 1 = 5, A x = 10, A 2 = 2, A 3 = 1, A 4 = 0.5, D = 100. 
Since it has been shown that the covariance structure does not affect the generalization ability 
of FLDA, we fix the population covariance as £ = I. Therefore, in each experiment, training 
examples are sampled from A/"(/Xj,I), i = 1, c + 1, where fi { are chosen such that they give 
the specified population discrimination power A«. We vary the dimensionality to training sample 
number ratio 7 = D/N from to 1, and for each value of 7 we do 10,000 times independent 
trials to obtain the realizations of <5/. Finally, we calculate the asymptotic lower bound rj^ and 
compare it with the scatters of <5j. 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

7 7 



Fig. 2. Evaluation of the asymptotic generalization bound on Example 1. 
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0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

7 7 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

7 7 



Fig. 3. Evaluation of the asymptotic generalization bound on Example 2. 



The results of evaluations on the three synthetic data examples are shown in Figure [2] to Figure 
|4} from which we have the following observations: 

1) The shape of the i] i curve fits the scatters of 8i well, which indicates the tightness of the 
asymptotic generalization bound. Though hard to prove, we think the tightness is due to 
the deterministic character of the bound, i.e., when D is sufficient large the bound holds 
almost surely rather than in a probabilistic sense. 

2) Theoretically, the asymptotic generalization bound, i.e., Si < r\ { , holds in the limit case. 
However, in all the three examples above, D > 100 is substantially enough for the validity 
of the bound. This suggests it can be used to evaluate the performance of FLDA as long 
as the dimensionality is moderate. 
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7 7 

Fig. 4. Evaluation of the asymptotic generalization bound on Example 3. 



B. On Real Datasets 

We choose three datasets for real data evaluation, all from the UCI machine learning repository 
OH : 1) the image segmentation (ImageSeg) dataset, which contains 7 classes and in total 2,310 
examples from IR 19 ; 2) the Landsat dataset, which constants 6 classes and in total 6,435 examples 
from IR 36 ; and 3) the optical recognition of handwritten digits (Optdigits) dataset, which contains 
10 classes and in total 5,620 examples from IR 60 . All these datasets are benchmarks in the 
literature of FLDA. On each dataset, we first estimate £ and S with all data and treat them as 
population parameters. Note that for all the three datasets, it holds N ^> D, and thus we can 
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suppose these estimates to be reliable. The experimental procedures are similar to the synthetic 
data case, except that the population discrimination power Aj are calculated with £ and S rather 
than specified by ourselves. 

The results of evaluations on the three datasets are shown in Figure [5] to Figure |7J These 
results again confirm the validity of the asymptotic generalization bound. However, it is worth 
noticing that the tightness of the bound is not as good as in the synthetic data case. This is 
because the data from real datasets only occupy a finite set of the entire feature space, and thus 
we cannot obtain the almost worst-case realizations of Si provided by Monte Carlo experiments 
on synthetic data. 
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7 7 



Fig. 5. Evaluation of the asymptotic generalization bound on the ImageSeg dataset. 
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Fig. 6. Evaluation of the asymptotic generalization bound on the ImageSeg dataset. 
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Fig. 7. Evaluation of the asymptotic generalization bound on the OptDigits dataset. 
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VI. Conclusion 

FLDA is an important dimension reduction tool in statistical pattern recognition and has been 
successfully applied in practice. This paper has studied the asymptotic generalization bound of 
FLDA which is valuable in both theoretical and practical aspects. It shows that generalization 
ability of FLDA is determined by the population discrimination power and the ratio of the 
dimensionality to the training sample number. Given the asymptotic generalization bound, we 
can decide how many training samples compared to the dimensionality are required to obtain 
an acceptable generalization performance of FLDA. 
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VII. Technical Proofs 
This section provides proofs of lemmas used in previous analysis. 

A. Proof of Lemma [7J 

The proof is divided into two steps, 
i) Since X* is nonsingular in Proposition pi we can express W* as 



W* = X*Q, 



for some Q e R Uxc . Then, 



A(S,S|W*) = Tr((W* T EW*) -1 W* T SW*) 

= Tr((Q T X* T £X*Q)- 1 Q T X* T SX*Q) 
= Tr((Q T Q)- 1 Q T X* T AQ) 
= Tr((Q T Q)- 1 QfA 1 Q 1 ) 
= Tr(Q 1 (Q r Q)- 1 QfA 1 ) 

c 



where Qi contains the first c rows of Q and 



Si = {Qi(Q Q) Qi}«. 



(35) 



(36) 



(37) 



ii) Similar to Proposition |3j we can augment W* with some V* G IR Dxc to simultaneously 
diagonalize S and S, and thus have 



W* T £W* = I c and W ,J SW* = A x 



(38) 



where Ai is some c x c diagonal matrix. Then, substituting (35) into (38) and recalling S 
X* T SX* and S = X* T SX*, we get 



Q T S Q = I c and Q T S Q = A x . 



(39) 



Given the eigendecomposition S = UA(S )U T , we have from the first equation in (39) that 
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there must exist some orthogonal matrix O G IR Dxc , O t O = I c , such that 



Q = UA-2(S )O. 



(40) 



Further, given the eigendecomposition So = V T A(So)V, we get from the second equation in 

T A-5(s )U T VA(So)V T UA-5(S )0 = A x . (41) 



(39) that 



In addition, since So has rank c, we can rewrite (41 ) as 



O i A-2(So)U i V 1:c A 1 5 (So)A 1 5 (So)VLUA^(S )0 = A 1; 



(42) 



where A 1 (S ) is the first c x c diagonal block of A(£ ). (42) implies the columns of O 
must be the left singular vectors of A _ 5(S )U T Vi :c A^ (So). Thus, O spans the range space 
of A~5(£o)U T Vi ;c A^ (So) and therefore the range space of A~5(£ )U T Vi :c . Then, there must 
exist some matrix A G R cxc such that A~5 (£ )U T V 1:C = OA, and thus 



O = A-t(S )U T V 1:c A-\ 

where the nonsingularity of A is implied by the nonsingularity of A~5(£ )U T . 
By ((40]) and ((43]), we have 

Q = UA- 1 (S )U T V 1:c A, 



(43) 



and 



Qi =ILUA- 1 (S )U T V 1:C A 



(44) 



(45) 



Therefore, 

{Qi(Q T Q)^Qik = 

efUA- 1 (S )U T V 1:c (V?; c UA- 2 (So)U T V 1:c )- 1 VLUA- 1 (So)U T e l 
Letting R = ^(A- 1 (S )U T Vi :c ), then 



(46) 



RR T = A- 1 (So)U T V 1:c (V^UA- 2 (So)U T V 1:c )- 1 VLUA- 1 (S ), 



(47) 
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which together with (|4"6|) gives 



{QiCQfQ^-'Qi}* = efURR T U T e, = ||R T U T ej || 2 . (48) 
This completes the proof. 

B. Proof of Lemma [2] 

By Proposition |4} we have 

^ c+l n 
i=l j=l 

where x* is sampled from A/"(/x f ,I) and 5q is the sample mean. Letting z*. = x* — which 
means z* is sampled from the standard Gaussian distribution jV(0, I), then S can be rewritten 
as 

1 c+l n 

£ o = ^EE( z 5-^)( z 5-^) T , (50) 

i=l j=l 



with z J ~ jV(0, -I). One property of S in (50) is that, as a random variable, its distribution is 



invariant to orthogonal similarity transformation, i.e., S and US U T , where U T U = I have 
the same distribution. This is a result of the fact that O T S O corresponds to (50) in the case 



of replacing z* by Oz* and Uz* has the same distribution with z*, i.e., the standard Gaussian 
distribution A/"(0,I). Then, according to Theorem 3.2 in 11321 , due to the invariant property to 
orthogonal similarity transformation, the distribution of S is independent of its eigenvectors U 
but only depends on its eigenvalues A(£ ), an d thus U should be a random variable uniformly 
distributed on the set of all possible orthonormal matrices. This completes the statements 1) and 
2) in Lemma |2j 
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In addition, ( f5Q| ) can be rewritten as 

c+1 n 1 c+1 

i=l j=l i=l 



zV T 



iT 



^ c+l n c+1 

N £^£^ 3 3 c+1 n Z^ v v (51) 

«=i j=i v ' «=i 

= — GiGf - — G2G2 

= T 1 + T 2 . 

where Gi G R DxN , G 2 G M Dx ( c+1 ), and both have entries i.i.d. from 7V(0, 1). For the first 
term T\ = ^GiGf, by Proposition [TJ we know that the empirical distribution of its eigenvalues 
converges almost surely to F 7 (j) with density, 

where 7 = D/N and 

A+ = (1 + Vlf and A_ = (1 - ^ff. (53) 

For the second term T 2 = j^G2G^, clearly it has finite rank c+1. According to [33J, a finite 
rank perturbation does not effect the convergence of the empirical spectral distribution, i.e., 
limF JV (A(T 1 + T2)) = \xmF N {\{T x )) = F 7 (A). This completes the proof. 

C. Proof of Lemma [5] 

The condition that £ is a unit-length random vector uniformly distributed on the unit sphere 
Ei ^ 1 can be replaced by £ G IR D with entries independently sampled from A/"(0, 1/-D). This is 
because, in the later case, £/||£|| is uniformly distributed on § D_1 , and in limit ||£|| 2 1 due 
to the strong law of large numbers. 



For (24), we divide the proof into two steps. First, we show that £ T A _1 (E )£ -^+- J \~ 1 dF J (\), 
and then we calculate the integral. 

i) Recall A_ = (1 — y/l) 2 , and let A 1 (S ) = diag(min{A_, A~ 1 (S )}), i.e., a truncated version 
of A _1 (E ) by clamping to be Al 1 if \7 l {%) > XZ 1 . Then, we divide the lefthand 
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side of (|24j) into three terms 



(54) 



e T A" i (S )e--Tr(A- i (S )), 



(55) 



and 



D 



Tr(A (S )) - / \~ dFy(X) 



Then, we show that all the three terms converge almost surely to zero. 



— i 



For the first term ( |54| ), we have 

0<^ T (A- 1 (S )-A^(S ))e 
^fmax^A-^-A: 1 }. 
By (??) and the arguments in the proof of Lemma [2} we know that 

c+l n 



limA min (£ ) = limA 



nun 1 ^^2^2 Z j Z f I ( lmi ^=<T,„,„(Z 



N 



(56) 



(57) 



(58) 



where Z = [z\, z^j +1 ] e IR DxAr , with entries independently sampled from A/"(0, 1). By Propo- 



sition^! we have lim 



1 — y/j. Thus, A 



^1 — ^x) 2 = A_. Accordingly, 



max{0, A-J n (So) - AI 1 } ^> 0. 



(59) 



Then, by ||£|| 2 ^> 1, (57) and (59), we have 



c'.\ L (£ )£- c'\ •iSo)i '•%(). 



(60) 



— l 



For the second term (55), since ||A (S )|| < A_ for all D, i.e., it is uniformly bounded, we 



apply Theorem 3.4 in 11291 and get 



£ T Ar(£ )£-^Tr(A„ (£ ))^0. 



(61) 
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For the third term (56), since dF 7 (A) is nonzero on the [A_, A + ], it is sufficient to 



D 



Tr(A^(S ))-y A-MF 7 (A) 

min(A_,A" 1 )dF iV (A) - / A _1 dF 7 (A) 



A_ 



A- 1 d(F JV (A) - F 7 (A)) + AI 1 / dF N (X) + / X^dF^X) 



Sine F N (X) ^ F 7 (A) and A" 1 is bounded on [A_, A+], it holds (21 

A _1 d(Fjv(A) - F 7 (A)) =^ 0. 

Further, sine F 7 (A_) = and F 7 (A + ) = 1, it holds 

[ X dF N (X) = F N (X-)^F,(X_)=Q, 
Jo 



and 



Thus, 



POO 

< / X- l dF N (X) < X-\l - F N (X + )) ^ X-\l - F 7 (A + )) = 0. 
Jx+ 



— l , 



D 



ii) We now calculate the integral 



Tr(A Q (So)) - / X~~ dFy(X) ^ 



A _1 dF 7 (A) 



where A+ = (1 + y/j) 2 and A_ = (1 - y/jf 



^ V /(A + -A)(A-A_) 
2?T7A 2 



dX 



Letting A = 1 + 7 — 2^ cos x, 16 [0, 7r] and substituting it into (67), we have 

2 r 



sin 2 x 



7T 7 (1+7-2^/70080:) 
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Further, letting t = tan § , we have 



/ = if ( ^ )2 , * * 



16 /- 00 t 2 1 , 

:dt 



71 



o ((l+ 7 )(t 2 + l)-2 v ^(l-t 2 )) i l+^ 2 



i6 r t 2 i , 



7T 



((i + V7) 2 i 2 + (i-Vr) 2 ) 2l + t 2 

16 r 00 t 2 1 



^ + (^) 2 )' 



dt. 



1— /7 

Letting a = and by partial fraction, we have 



t 2 1 



dt= I t^ dt 



o (t 2 + « 2 ) 2 l + t 2 Jo ^ 2 + l 

i 



+ / r <^ dt. 



t 2 + a 2 J (t 2 + a 2 ) 



Denoting by I\, I 2 and I 3 the terms in the righthand side of (70), we have 

i 



h — I — It — ~~—dt = — — - / d arctant 



t 2 + l (I -a 2 ) 2 J 2(1 -a 2 f 



I2 = I ^ ° \ dt — — - / darctan - 



t 2 + a 2 a(l - a 2 ) 2 J a 2a(l - a 2 ) 2 : 



o (£ 2 + a 2 ) 2 



1 r° , t -1 r 00 1 



2(1- a 2 ) 7 t 2 + a 2 2(1- a 2 ) 7 t 2 + a 2 




4a(l-a 2 ) 4a(l-a 2 ) 
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Combining (69) to (73) and noticing a = we get 



16 



— 7T 



7T 



— IT 



tt(1 + a/t") 4 V2(l -a 2 ) 2 2a(l-a 2 ) 2 4a(l - a 2 ) 

16 7T 



(1 + 77) 4 4«(1 + 



7T 



a 



(74) 



1 



1-7 



This completes the proof of (24). 



For (25), by the same strategy as used in the proof of (24), we have £ A (£ )£ 
J A~ 2 <iF 7 (A). Below, we calculate the integral. 



A- 2 dF 7 (A) 



a + V /( A+ -A)(A-A_) 



dA, 



(75) 



27r 7 A 3 

where A + = (1 + a/7) 2 and A_ = (1 — a/7) 2 - Letting A = 1 + 7 — 2^/cosx, x G [0,7r] and 
substituting it into ( [67] ), we have 

_ 2 



sin 2 x 



IT Jq (1 + 7 — 2a/7 COSx) 



(76) 



Further, letting t = tan f , we have 



(1 + 7-2^7^) 



7T _ , 

16 roc 



1^\ 3 1 +t 2 



(It 



* Jo ((l + 7 )(t3 + l)- 2VT(1 ~t 2 )Y 

16 f 00 t 2 

— / odt 

* JO ((l + v ^)2 t 2 + (1 _ v ^)2)3 



T dt 



(77) 



16 



?r(l + a/7) 6 Jo 



1— /7 

Letting a = we have 



00 ^2 J /- v 

dt = - 



t 1 r°° 1 
rf— — — + 



( t 2 + a 2)3 ~ 4> / (t2 + a 2)2 4j / o (t 2 + a 2 



-dt 



(78) 



7T 



16a 3 
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Thus, by a 



, we get I 



16 



717(1+^)6 16a 3 (1-7)' 



: . This completes the proof of (25) 



D. Proof of Lemma |4] 

Suppose the original distributions of the c + 1 classes are and the between-class 

scatter matrxi is S. Since X* simultaneously diagonalizes £ and S, the normalized covariance I 
and between-class scatter matrix A correspond to distributions jV(/4, I), wherein ii[ = X* T ^, 
and A = ^YTJM - *0(m, - M) T , with fi' = Letting M = [M, -X + i] 

and E G R( c + 1 ) x ( c + 1 ) with all entries equal to we have A = ^M(I - E)(I - E) T M T . 
Similarly, we have S = ^M(I - E)(I - E) T M T , with M = ...,fi' c+1 }. since there are n 
training examples for each class, we have M = M + X, wherein the entries of X G ]R jDx ( c+1 ) 
are i.i.d. samples from J\f(0, 1/n). 

Note that the nonzero diagonal entries of A are Aj, i = 1,2, ...,c, with eigenvectors e;, 
i = 1, 2, c. Then, A = ^M(I - E)(I - E) T M T implies that M(I - E) has singular values 
y/(c+ l)Ai, i = l,2, c and left singular vectors Ii :c = [ei, e c ]. Denoting by Q G R( c+1 ) xc 
the right singular vectors, Q T Q = I c , we have 



M(I - E)Q = [V(c+l)Ai ei , y/(c + l)A c e c 

Consequently, by M = M + X, we have 

M(I - E)Q = [V(c+l)A iei , v/(c+l)A c e c ] + X(I - E)Q 



(79) 



(80) 



where & = ^(c + l)A i e,+X(I-E)Q i , i = 1, 2, c. Then, by S = ^rM(I-E)(I-E) T M T , 
we have for the first c eigenvectors V 1:c of S that 



V 1:c = n(M(l - E)) = K(M(I - E)Q) = n([$ v £ c ]). 



(81) 
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Thus, 

\\VLei\\ = \\K T ([£ 1 ,...,Z c ])e i \\ 

1 1 ? i 1 1 

|efy/(c+l)A,e t + efX(I-E)Q t 
|| v/(c + l)A,e, + X(I - E)Q,|| 
> v /(c+l)A t -|efX(I-E)Q I | 
> /(c + l)A i + ||X(I-E)Q i || ' 

It can be verified that as N = (c + \)n — > oo 



(82) 



|efX(I-E)Q l | < ||efX| 



c+l 



\ 3=1 



(83) 



where the inequality is due to ||(I — E)Qj|| < ||(I — E)|| ||Qj|| < 1 and the limit is because Xjj 
follows the distribution A/"(0, -). 

In addition, by Proposition [2] and letting G = ^/nX, we have 



Thus, 



xii i|| G ||^Jf./l£+5£^^^ 

\/n V n V JM 

||X(I-E)Qi|| < ||X|| ^ v/f^+lh- 



(84) 



(85) 



Combining (82), (83) and (85), we obtain 



lim llV^f > — — . 

D/N — >7 1C J " ~~ Aj + 7 



a.s. 



This completes the proof. 



(86) 



E. Proof of Lemma [3] 

Recall LemmaQthat 6 { = ||^ T (A- 1 (S )U T V 1:c )U T e l || 2 . Denote by Z(U T e l ,^(A- 1 (S )U T V 1:c )) 
the angle between vector U T ej and subspace 7£ r (A _1 (Eo)U T V 1:c ), we have 



Si = cos 2 (Z(U T e,,^(A- 1 (S )U T V 1:c ))). 



(87) 
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Two basic facts that hold for arbitrary vector x l5 x 2 and subspace X are 

Z(x 1; X) < Z(x 1; x 2 ) + Z(x 2 , X). (88) 

and 

Z(xi,X) < Z(xi,x), if x G X. (89) 



Then, by using ( [88] ) and ( |89| ), we get 

/(U^^A-^Eo^V,)) 
</(U T ei ,U T V 1:c VLeO + Z(U T V 1:c VLe,,^(A- 1 (S )U T V 1;c )) 
</(U T ei ,U T V 1:c VLei) + Z(U T V 1:c VLe,, A- 1 (S )U T V 1:c VLe l ) 
=9i + 9 2 . 



(90) 



Denoting 9 = Q\ + 9 2 , since cos (a;) is positive and decreasing on [0, it/ 2], x 2 is increasing on 
[0, 1], and Si is nonnegative, we have 



Si > 



cos 2 (9), 9< \ 

0, else (91) 



= max 2 {cos(#), 0}. 

It remains to calculate 9\ and 9 2 . For 9\, We have 

_ |e,VLUU r Vi :c e,| 2 _ |efV 1:c VLe,| 2 _ ^ a 
C ° S ^ " ||U-V 1:c VLe,P " efV 1:c VLe, " l|Vl ^ 11 ' ^ 

which gives 

9 X =arccos(||V?; c e i ||). (93) 
For 6*2, as rescaling does not change the direction of a vector, we can rewrite 9 2 as 

^ 2 = Z(U T e,A- 1 (S )U T 0, (94) 

where 



Vi :c VLei 
|Vi !c VLei 



C= ^ vT-'r ( 95 ) 



Note that £ is a unit-length random vector and is independent of U due to the independency 
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between V 1:c and U. Then, we have 

cosH92) = lev^mi = ruA-(E )u^)^ 

||A- 1 (So)U T C|| 2 C T UA- 2 (S )U T C 
We have known, from Lemma [2j U is uniformly distributed on the set of all orthonormal matrices 
in ]R DxD , and ( is a unit-length random vector independent of U. Thus, £ = U T ( must be a 



unit-length random vector uniformly distributed on the unit sphere E> D \ Finally, (|96j) gives 



9 2 = arccos ^ T A- 1 (S )£/^A^(S )^ . (97) 
This completes the proof. 
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